5  Logistic Regression with Regularization

“Cricket is basically baseball on valium.” - Robin Williams

During the infamous 1932-33 Test series, Australia were in the line of fire as the England fast bowler Harold Larwood employed “bodyline” tactics to disrupt his opponents. One delivery struck Bill Woodfull near the heart, causing him to stagger; another hit Bert Oldfield on the head, leaving him concussed.

Logistic regression is frequently used to model binary outcomes, such as predicting the likelihood of a player winning a match or scoring above a certain threshold. However, as datasets grow in complexity, it is important to prevent overfitting and ensure the model generalizes well to new data. This is where regularization comes in. In this section, we will introduce logistic regression with regularization.

5.1 Introduction to Logistic Regression

Logistic regression is used when the dependent variable is binary. It models the probability of the outcome as a function of the predictors, where the output is transformed through the logistic function, ensuring predictions lie between 0 and 1.

The logistic function is given by:

\[ P(y = 1 | x) = \frac{1}{1 + e^{-(\beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p )}} \]

Where:

  • \(P(y = 1 | x)\) is the probability of the event occurring,
  • \(\beta_0\) is the intercept,
  • \(\beta_1,\ldots \beta_p\) are the coefficients for the predictor variables \(x_1,\ldots, x_p\).

5.1.1 Interpreting Coefficients

In logistic regression, the coefficients represent the relationship between each predictor variable and the log-odds of the outcome. However, interpreting the coefficients directly can be difficult because they are expressed in terms of the log-odds rather than a probability scale.

To understand the effect of each predictor on the outcome, we often transform the coefficients into odds ratios. The odds ratio for each predictor shows how the odds of the outcome change with a one-unit increase in the predictor variable, holding other variables constant.

5.1.2 The Log-Odds and Coefficients

The logistic regression model can be written as:

\[ \log\left(\frac{P(y = 1 | x)}{1 - P(y = 1 | x)}\right) = \beta_0 + \beta_1 x_1 + \cdots + \beta_p x_p \]

Where:

  • The left side represents the log-odds of the event occurring.
  • The right side includes the intercept \(\beta_0\) and the coefficients \(\beta_1,\ldots, \beta_p\) for the predictor variables \(x_1,\ldots, x_p\).

Each coefficient \(\beta_j\) represents the change in the log-odds of the outcome for a one-unit increase in the predictor \(x_j\). For example, if \(\beta_j\) is positive, it means that as \(x_j\) increases, the log-odds of the event occurring increases, which in turn increases the probability of the event.

Odds Ratios

To make the coefficients more interpretable, we exponentiate them to obtain odds ratios. The odds ratio for a predictor \(x_j\) is given by:

\[ \text{OR}_j = e^{\beta_j} \]

The odds ratio tells us how the odds of the outcome change for a one-unit increase in \(x_j\). Specifically:

  • If \(\text{OR}_j > 1\), a one-unit increase in \(x_j\) increases the odds of the outcome occurring.
  • If \(\text{OR}_j = 1\), a one-unit increase in \(x_j\) does not change the odds of the outcome occurring.
  • If \(\text{OR}_j < 1\), a one-unit increase in \(x_j\) decreases the odds of the outcome occurring.

Example Interpretation

Let’s assume we have a logistic regression model that predicts the likelihood of a cricket player winning a match, and the model includes the predictor variables such as the player’s average score, number of wickets taken, and age. Suppose the coefficient for the player’s average score (\(\beta_1\)) is 0.5, and the exponentiated value of \(\beta_1\) (i.e., the odds ratio) is 1.65.

  • Coefficient Interpretation: The coefficient of 0.5 means that for every 1-unit increase in the player’s average score, the log-odds of winning the match increase by 0.5. This suggests that higher scores improve the likelihood of winning.
  • Odds Ratio Interpretation: The odds ratio of 1.65 means that for each 1-unit increase in the player’s average score, the odds of winning increase by 65%. Thus, the higher the player’s average score, the higher the odds of winning.

It is important to note that the interpretation of each coefficient is based on the assumption that other variables in the model are held constant. For example, the coefficient for the player’s age might be interpreted as the change in the log-odds of winning for a one-year increase in age, assuming that the player’s score and number of wickets remain constant. scale.

5.2 Cricket Data

Cricket is a widely loved sport, with billions of fans around the world. It ranks second in global popularity, only behind football (soccer), and is particularly cherished in South Asia, Australia, Africa, and Europe.

5.2.1 The Rules of Cricket

Cricket is played on a rectangular pitch surrounded by an oval-shaped boundary. At either end of the pitch stands a vertical wicket, consisting of three wooden stumps topped with two small wooden blocks called bails. The game involves two teams, each with 11 players: one team bats while the other bowls.

The Bowling Team

The bowler’s job is to deliver a small leather ball toward the batter, aiming to get them out by knocking the bails off the stumps.

The Batting Team

The batting team’s objective is to score as many runs as possible without allowing the bowler to dislodge the bails. There are always two batters on the field at a time. After hitting the ball, the batters attempt to swap sides. Each batter continues their turn until they are out, at which point they are replaced by another teammate.

5.2.2 Scoring Runs

A run is scored when the two batters successfully switch places without getting out. For instance, if a batter hits the ball and manages to reach the opposite wicket and return safely, their team scores two runs. If the batter hits the ball to the boundary of the oval on a bounce, their team scores four runs. A six is awarded if the ball is hit over the boundary on the full (without bouncing).

5.2.3 Ways to Get Out

A batter can be dismissed in several ways, including:

  • Caught out: The batter hits the ball, and it is caught in the air by a fielder.
  • Run out: A fielder hits the stumps with the ball before the batter can cross the line while attempting to score a run.
  • Bowled out: The bowler delivers a ball that goes past the batter and knocks the bails off the wicket.
  • Leg before wicket (LBW): The batter is hit in the legs by a ball that would have hit the stumps.

Cricket matches can vary in length, lasting anywhere from a few hours to multiple days, depending on the format. Because batters typically score many runs before getting out, the final scores are often quite high.

The asia_cup data set includes data from each cricket match played in all Asia Cup Tournaments from 1984 (the first one) to 2022. The Asia Cup is a tournament that now takes place every two years, alternating host cities in different countries throughout Asia.

5.3 Logistic Fit of the Cricket Data

In this section, we will use the cricket dataset to perform logistic regression and find the odds ratio for predicting whether a cricket player wins a match based on various predictors, such as the player’s average score, number of wickets, and age. We will leverage the tidymodels framework in R to carry out the analysis and interpret the results.

5.3.1 Preparing the Data

Let’s begin by loading the cricket dataset and inspecting the first few rows to understand its structure.

library(tidyverse)
library(tidymodels)

cricket_data <- read.csv("data/cricket_asia_cup.csv")

glimpse(cricket_data)
Rows: 252
Columns: 13
$ Team          <chr> "Pakistan", "Sri Lanka", "India", "Sri Lanka", "India", …
$ Opponent      <chr> "Sri Lanka", "Pakistan", "Sri Lanka", "India", "Pakistan…
$ Host          <chr> "Sharjah", "Sharjah", "Sharjah", "Sharjah", "Sharjah", "…
$ Year          <int> 1984, 1984, 1984, 1984, 1984, 1984, 1986, 1986, 1986, 19…
$ Toss          <int> 0, 1, 1, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 1,…
$ Selection     <int> 0, 1, 1, 0, 0, 1, 1, 0, 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 1,…
$ Run.Scored    <int> 187, 190, 97, 96, 188, 134, 116, 197, 94, 98, 132, 131, …
$ Fours         <int> 9, 11, 9, 7, 13, 5, 10, 14, 0, 4, 0, 0, 15, 11, 5, 10, 5…
$ Sixes         <int> 3, 1, 0, 0, 3, 0, 0, 3, 0, 0, 0, 0, 1, 4, 0, 0, 0, 1, 4,…
$ Extras.Scored <int> 21, 26, 14, 8, 17, 5, 14, 15, 9, 5, 19, 10, 9, 14, 7, 18…
$ Highest.Score <int> 47, 57, 51, 38, 56, 35, 34, 39, 37, 47, 44, 40, 57, 67, …
$ Result        <int> 0, 1, 1, 0, 1, 0, 0, 1, 0, 1, 0, 0, 1, 0, 0, 1, 0, 1, 0,…
$ Given.Extras  <int> 26, 21, 8, 14, 5, 17, 15, 14, 5, 9, 10, 19, 14, 9, 18, 7…

Before fitting a logistic regression model, we need to clean and prepare the data. This may involve:

  • Converting any categorical variables to factors (if necessary),
  • Handling missing values,
  • Creating a binary outcome variable (e.g., win or lose),
  • Splitting the data into training and test sets.
cricket_data <- cricket_data %>%
  mutate(Result = as.factor(Result)) |> 
  select(-Team, -Opponent, -Host, -Year)

set.seed(1004)
split <- initial_split(cricket_data, prop = 0.8)
train_data <- training(split)
test_data <- testing(split)

5.3.2 Specifying the Logistic Regression Model

Now that the data is prepared, we will specify a logistic regression model using the tidymodels framework.

log_reg_spec <- logistic_reg() %>%
  set_engine("glm") %>%
  set_mode("classification")


cricket_recipe <- recipe(Result ~ ., data = train_data) %>%
  step_normalize(all_predictors())

Next, we will combine the model specification and recipe into a workflow and fit the model to the training data.

log_reg_workflow <- workflow() %>%
  add_recipe(cricket_recipe) %>%
  add_model(log_reg_spec)

log_reg_fit <- fit(log_reg_workflow, data = train_data)

Once the model is fitted, we can evaluate its performance and calculate the odds ratios for the predictors. We will use the test data to predict the outcome and assess the model’s accuracy, then exponentiate the coefficients to obtain the odds ratios.

predictions <- predict(log_reg_fit, test_data, type = "prob")

coef_values <- tidy(log_reg_fit)
coef_values <- coef_values %>%
  mutate(odds_ratio = exp(estimate))  

coef_values
# A tibble: 9 × 6
  term          estimate std.error statistic p.value odds_ratio
  <chr>            <dbl>     <dbl>     <dbl>   <dbl>      <dbl>
1 (Intercept)      0.100     0.157     0.637  0.524       1.11 
2 Toss             0.359     0.162     2.21   0.0268      1.43 
3 Selection        0.147     0.170     0.863  0.388       1.16 
4 Run.Scored      -0.348     0.360    -0.966  0.334       0.706
5 Fours            0.447     0.291     1.54   0.124       1.56 
6 Sixes            0.270     0.196     1.38   0.169       1.31 
7 Extras.Scored   -0.139     0.181    -0.766  0.444       0.870
8 Highest.Score    0.678     0.277     2.44   0.0145      1.97 
9 Given.Extras     0.219     0.183     1.20   0.232       1.24 

5.4 Regularization in Logistic Regression

In this section, we will apply regularization techniques to improve the performance of the logistic regression model for predicting cricket match outcomes. Regularization helps reduce overfitting, especially when the number of predictors is large or the data is noisy. We will explore three types of regularization as was discussed in the previous section:

  • L1 regularization (Lasso)
  • L2 regularization (Ridge)
  • Elastic Net (a combination of L1 and L2)

5.4.1 Lasso Model Specification

In tidymodels, Lasso can be specified by setting the mixture parameter to 1 in the model specification. The penalty controls the strength of the regularization.

log_reg_lasso_spec <- logistic_reg(penalty = 0.01, mixture = 1) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

log_reg_lasso_workflow <- workflow() %>%
  add_recipe(cricket_recipe) %>%
  add_model(log_reg_lasso_spec)

log_reg_lasso_fit <- fit(log_reg_lasso_workflow, data = train_data)

tidy(log_reg_lasso_fit)
# A tibble: 9 × 3
  term          estimate penalty
  <chr>            <dbl>   <dbl>
1 (Intercept)     0.0923    0.01
2 Toss            0.283     0.01
3 Selection       0.138     0.01
4 Run.Scored      0         0.01
5 Fours           0.262     0.01
6 Sixes           0.183     0.01
7 Extras.Scored  -0.144     0.01
8 Highest.Score   0.503     0.01
9 Given.Extras    0.131     0.01

5.4.2 Ridge Model Specification

To apply Ridge regularization, we set the mixture parameter to 0.

log_reg_ridge_spec <- logistic_reg(penalty = 0.01, mixture = 0) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

log_reg_ridge_workflow <- workflow() %>%
  add_recipe(cricket_recipe) %>%
  add_model(log_reg_ridge_spec)

log_reg_ridge_fit <- fit(log_reg_ridge_workflow, data = train_data)
tidy(log_reg_ridge_fit)
# A tibble: 9 × 3
  term          estimate penalty
  <chr>            <dbl>   <dbl>
1 (Intercept)     0.0942    0.01
2 Toss            0.311     0.01
3 Selection       0.154     0.01
4 Run.Scored     -0.126     0.01
5 Fours           0.346     0.01
6 Sixes           0.236     0.01
7 Extras.Scored  -0.159     0.01
8 Highest.Score   0.532     0.01
9 Given.Extras    0.188     0.01

5.4.3 Elastic Net Model Specification

log_reg_en_spec <- logistic_reg(penalty = 0.01, mixture = 0.5) %>%
  set_engine("glmnet") %>%
  set_mode("classification")

log_reg_en_workflow <- workflow() %>%
  add_recipe(cricket_recipe) %>%
  add_model(log_reg_en_spec)

log_reg_en_fit <- fit(log_reg_en_workflow, data = train_data)
tidy(log_reg_en_fit)
# A tibble: 9 × 3
  term          estimate penalty
  <chr>            <dbl>   <dbl>
1 (Intercept)     0.0939    0.01
2 Toss            0.303     0.01
3 Selection       0.150     0.01
4 Run.Scored     -0.0655    0.01
5 Fours           0.305     0.01
6 Sixes           0.211     0.01
7 Extras.Scored  -0.155     0.01
8 Highest.Score   0.530     0.01
9 Given.Extras    0.163     0.01

Tuning the Penalty and Mixing Hyperparameters

The penalty and mixing hyperparameters control the regularization strength and the balance between Lasso and Ridge. These can be tuned in logistic regression in the same way using cross-validation as discussed in the last section involving linear regression.